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Abstract — In this study, both Bayesian classifiers and mutual- 
information classifiers are examined for binary classifications 
with or without a reject option. The general decision rules 
in terms of distinctions on error types and reject types are 
derived for Bayesian classifiers. A formal analysis is conducted to 
reveal the parameter redundancy of cost terms when abstaining 
classifications are enforced. The redundancy implies an intrinsic 
problem of '■'■non-consistency" for interpreting cost terms. If no 
data is given to the cost terms, we demonstrate the weakness 
of Bayesian classifiers in class-imbalanced classifications. On the 
contrary, mutual-information classifiers are able to provide an 
objective solution from the given data, which shows a reasonable 
balance among error types and reject types. Numerical examples 
of using two types of classifiers are given for confirming the 
theoretical differences, including the extremely-class-imbalanced 
cases. Finally, we briefly summarize the Bayesian classifiers 
and mutual-information classifiers in terms of their application 
advantages, respectively. 

Index Terms — Bayes, entropy, mutual information, error types, 
reject types, abstaining classifier, cost sensitive learning. 



I. Introduction 

The Bayesian principle provides a powerful and formal 
means of dealing with statistical inference in data processing, 
such as classifications If classifiers are designed based 
on this principle, they are called '''Bayesian classifiers'" in this 
work. The learning targets for Bayesian classifiers are either 
the minimum error or the lowest cost. It was recognized that 
Chow ||2||l3l was "among tlie earliest to use Bayesian decision 
theory for pattern recognition" |4 |. His pioneering work is so 
enlightening that its idea of optimal tradeoff between error and 
reject still sheds a bright light for us to deep our understanding 
to the subject, as well as to explore its applications widely in 
this information-explosion era. In recent years, cost sensitive 
learning and class-imbalanced learning have received much 
attentions in various applications [12-18]. For classifications 
of imbalanced, or skewed, datasets, "the ratio of the small to 
the large classes can be drastic such as 1 to 100, I to 1,000, or 
1 to 10,000 (and sometimes even more)" flE\. It was pointed 
out by Yang and Wu ||T9| that dealing with imbalanced and 
cost-sensitive data is among the ten most challenging problems 
in the study of data mining. In fact, the related subjects are not 
a new challenge but a more crucial concern than before for 
increasing needs of searching useful information from massive 
data. Binary classifications will be a basic problem in such 
application background. Classifications based on cost terms 
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for the tradeoff of error types is a conventional subject in 
medical diagnosis. Misclassification from "type I error" (or 
"false positive") or from "type II error" (or "false negative") 
is significantly different in the context of medical practices. 
In other domains of applications, one also needs to discern 
error types for attaining reasonable results in classifications. 
Among all these investigations, cost terms, which is usually 
specified by users from a cost matrix, play a key role in class- 
imbalanced learning [1 1-I41 ll20lll46ll|j7l . 

In binary classifications with a reject option, Bayesian 
classifiers require a cost matrix with six cost terms as the 
given data. Different from the prior to the probabilities of 
classes, this requirement can be another source of subjectivity 
that disqualifies Bayesian classifiers as an objective approach 
of induction 1431 . If an objectivity aspect is enforced for 
classifications with a reject option, a difficulty does exist for 
Bayesian classifiers that assign cost terms objectively. The cost 
terms for error types may be given from an application back- 
ground, but are generally unknown for reject types. In binary 
classifications. Chow |3| and early researchers ||22] [)23l |l24l 
usually assumed no distinctions among errors and among 
rejects. The later study in |31| considered different costs for 
correct classification and miscalssifications, but not for rejects. 
The more general settings for distinguishing error types and 
reject types were reported in ll 25l ll27l llSFl . To overcome the 
problems of presetting cost terms manually, Pietraszek ll28]| 
proposed two learning models, namely, "bounded-abstention" 
and "bounded-improvement" , and Grail-Maes and Beauseroy 
[301 applied a strategy of adding performance constraints for 
class-selective rejection. If constraints either on total reject or 
on total error, they may result in no distinctions between their 
associated cost terms. Up to now, it seems that no study has 
been reported for the objective design of Bayesian classifiers 
by distinguishing error types and reject types at the same time. 

Several investigations are reported by following Chow's rule 
on classifier designs with a reject option [21-30]. In addition to 
a kind of "ambiguity reject" studied by Chow, the other kind 
of "distance reject" was also considered in [21]. Ambiguity 
reject is made to a pattern located in an ambiguous region 
between/among classes. Distance reject represents a pattern 
far away from the means of any class and is conventionally 
called an "outlier" in statistics |4l. Ha 1221 proposed another 
important kind of reject, called "class-selective reject", which 
defines a subset of classes. This scheme is more suitable 
to multiple-class classifications. For example, in three-class 
problems, Ha's classifiers will output the predictions including 
"ambiguity reject between Class 1 and 2", "ambiguity reject 
among Class 1, 2 and 3", and the other rejects from class 
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combinations. Multiple rejects with such distinctions will be 
more informative than a single "ambiguity reject". Among all 
these investigations, the Bayesian principle is applied again 
for their design guideline of classifiers. 

While the Bayesian inference principle is widely applied 
in classifications, another principle based on the mutual in- 
formation concept is rarely adopted for designing classifiers. 
Mutual information is one of the important definitions in 
entropy theory |38|. Entropy is considered as a measure of 
uncertainty within random variables, and mutual information 
describes the relative entropy between two random variables 
[|9l . If classifiers seek to maximize the relative entropy for 
their learning target, we refer them to ''mutual-information 
classifiers" . It seems that Quinlan [5] was among the earliest 
to apply the concept of mutual information (but called ''in- 
formation gain" in his famous ID3 algorithm) in constructing 
the decision tree. Kvalseth |6| and Wickens |T| introduced 
the definition of normalized mutual information (NMI) for 
assessing a contingency table, which laid down the foundation 
on the relationship between a confusion matrix and mutual 
information. Being pioneers in using an information-based 
criterion for classifier evaluations, Kononenko and Bratko BTl 
suggested the term "information score" which was equivalent 
to the definition of mutual information. A research team 
leaded by Principe [HI proposed a general framework, called 
"Information Theoretic Learning (ITL)", for designing vari- 
ous learning machines, in which they suggested that mutual 
information, or other information theoretic criteria, can be set 
as an objective function in classifier learning. Mackay [[9j, 
page 533] once showed numerical examples for several given 
confusion matrices, and he suggested to apply mutual infor- 
mation for ranking the classifier examples. Wang and Hu [10| 
derived the nonlinear relations between mutual information 
and the conventional performance measures, such as accuracy, 
precision, recall and Fl measure for binary classifications. In 
lim . a general formula for normalized mutual information was 
established with respect to the confusion matrix for multiple- 
class classifications with/without a reject option, and the ad- 
vantages and limitations of mutual-information classifiers were 
discussed. However, no systematic investigation is reported 
for a theoretical comparison between Bayesian classifiers and 
mutual-information classifiers in the literature. 

This work focuses on exploring the theoretical differences 
between Bayesian classifiers and mutual-information classi- 
fiers in classifications for the settings with/without a reject 
option. In particular, this paper derives much from and conse- 
quently extends to Chow's work by distinguishing error types 
and reject types. To achieve analytical tractability without 
losing the generality, a strategy of adopting the simplest 
yet most meaningful assumptions to classification problems 
is pursued for investigations. The following assumptions are 
given in the same way as those in the closed-form studies of 
Bayesian classifiers by Chow [3] and Duda, et al 

Al. Classifications are made for two categories (or classes) 

over the feature variables. 
A2. All probability distributions of feature variables are 

exactly known. 



One may argue that the assumptions above are extremely 
restricted to offer practical generality in solving real-world 
problems. In fact, the power of Bayesian classifiers does not 
stay within their exact solutions to the theoretical problems, 
but appear from their generic inference principle in guiding 
real applications, even in the extreme approximations to the 
theory. We fully recognize that the assumption of complete 
knowledge on the relevant probability distributions may be 
never the cases in real-world problems [31 1[33|. The closed- 
form solutions of Bayesian classifiers on binary classifications 
in f3lf4l have demonstrated the useful design guidelines that 
are applicable to multiple classes [22 1. The author believes 
that the analysis based on the assumptions above will provide 
sufficient information for revealing the theoretical differences 
between Bayesian classifiers and mutual-information classi- 
fiers, while the intended simplifications will benefit readers to 
reach a better, or deeper, understanding to the advantages and 
limitations of each type of classifiers. 

The contributions of this work are twofold. First, the analyt- 
ical formulas for Bayesian classifiers and mutual-information 
classifiers are derived to include the general cases with dis- 
tinctions among error types and reject types for cost sen- 
sitive learning in classifications. Second, comparisons are 
conducted between the two types of classifiers for revealing 
their similarities and differences. Specific efforts are made on a 
formal analysis of parameter redundancy to the cost terms for 
Bayesian classifiers when a reject option is applied. Section II 
presents a general decision rule of Bayesian classifiers with 
or without a reject option. Sections III provides the basic 
formulas for mutual-information classifiers. Section IV inves- 
tigates the similarities and differences between two types of 
classifiers, and numerical examples are given to highlight the 
distinct features in their applications. The question presented 
in the title of the paper is concluded by a simple answer in 
Section V. 

II. Bayesian Classifiers with A Reject Option 

A. General Decision Rule for Bayesian Classifiers 

Let X be a random pattern satisfying x S X C R^, which 
is in a d-dimensional feature space and will be classified. The 
true (or target) state t of x is within the finite set of two 
classes, t E T ^ {^1,^2}, and the possible decision output 
y — /(x) is within three classes, y E Y — {1/1,2/2,2/3}, 
where / is a function for classifications and 2/3 represents a 
"reject" class. Let p{ti) be the prior probability of class ti 
and p{x\ti) be the conditional probability density function of 
X given that it belongs to class ti. The posterior probability 
p{ti\x) is calculated through the Bayes formula [4[: 



where p{x) represents the mixture density for normalizing the 
probability. Based on the posterior probability, the Bayesian 
rule assigns a pattern x into the class that has the highest 
posterior probability. Chow fl] f3\ first introduced the frame- 
work of the Bayesian decision theory into the study of pattern 
recognition and derived the best error-type trade-off formulas 



3 



and the related optimal reject rule. The purpose of the reject 
rule is to minimize the total risk (or cost) in classifications. 
Suppose Xij is a cost term for the true class of a pattern to be 
ti, but decided as yj. Then, the conditional risk for classifying 
a particular x into yj is defined as: 

p{ x\U)p{U) 

p(x) ' (2) 

J = 1,2,3. 



2 2 

Risk{yj\x) = J2 Ajjp(ij|x) = J2 ^v' 



» 3 2 

Risk{y) = / Aijp(t,|x)p(x)dx, 



Note that the definition of Ay in this work is a bit different 
with that in |4|, so that Ay will form a 2 x 3 matrix. Chow 
|I3J assumed the initial constraints on Ay from the intuition in 
classifications: 

A^fe > A.3 > A., > 0, i^k, i = l,2, fc=l,2. (3) 

The constraints imply that a misclassification will suffer a 
higher cost than a rejection, and a rejection will cost more 
than a correct classification. Relations about Ay are the main 
concern in the study of cost-sensitive learning, and this issue 
will be addressed later in this work. The total risk for the 
decision output y will be |4|: 

3 2 

(4) 

with integration over the entire observation space V. 

Definition 1 (Bayesian classifier): If a classifier is deter- 
mined from the minimization of its risk over all patterns: 

y* — ar g min Risk (y) , (5a) 
y 

or in anther form on a given pattern x: 

Decide yj if Risk{yj\x) — min Risk{yi\x) (5b) 

i 

this classifier is called "Bayesian classifier", or "Chow's 
abstaining classifier" |(27|. The term of Risk{y*) is usually 
called "Bayesian risk", or "Bayesian error" in the cases that 
zero-one cost terms (An = A22 = 0, A12 = A21 ~ 1) are used 
for no rejection classifications [4 1. 

In |[3l, a single threshold for a reject option was investigated. 
This setting was obtained for the assumption that cost terms 
are applied without distinction among the errors and among 
rejects. Following Chow's approach but with extension to the 
general cases to cost terms, one is able to derive the general 
decision rule on the rejection for Bayesian classifiers. 

Theorem 1: The general decision rule for Bayesian classi- 
fiers are: 

n P{x\h)p{h) ^ ^ 

Decide yi ij ^_ ^ — -j—^ > Oi, 



p{x\t2)p^2) ^ 

No rejection : Si = 

A12 — All 
A21 — A23 



(6a) 



Rejection : Si — 



13 



All 



„ . , . , p{x\ti)p{ti) 
Decide 2/2 «/ ^ < <52, 

P{x\t2)p{t2) ^ 

No rejection : S2 — --^ r^, 

A12 — All 
A23 — A22 

A12 ^ A13 



(6b) 



Rejection : S2 



Decide ys if - 
p{x\ti)p{ti) 



A 



23 



22 



< 



< 



Tr2 
A2I - 



A12 
A23 



_A 

Ai3 

i-r,i 



(6c) 



p{x\t2)p(t2) - Ai3 - All T, 



rl 



Subject to < 
A21 — A23 



A 



23 



< 



A 



13 



All' 



A12 — Ai3 

and 



A22 ^ A21 



\22 



No rejection : T,.i = 
Rejection : < T^i 



All 



Tr2 = 0.5, 
-Tr2 < 1. 



(6d) 



(6e) 



Eq ( |6c] i applies the definition of two thresholds (called "rejec- 
tion thresholds" in lO), T^i and Tr2- 

Proof: See Appendix A. ■ 
Note that eq. (l6d] i suggests general constraints over A^ . The 
necessity for having such constraints is explained in Appendix 
A. A graphical interpretation to the two thresholds is illustrated 
in Fig. 1. Based on eq. (l6c] i. the thresholds can be calculated 
from the following formulas: 



Ai3 — All 



Irl 



Ai3 — All + A21 — A23 
A23 ~ A22 



(7) 



A12 ^ A13 + A 



23 



A 



22 



Eq. (|7]l describes general relations between thresholds and cost 
terms on binary classifications, which enables the classifiers to 
make the distinctions among errors and among rejects. Note 
that the special settings of Chow's rules [3] can be derived 
from eq. (|7]i: 

All = A22 = 0, A12 = A21 = 1, A13 = A23 = Tr- (8) 



Another important relation in 11281 can also be obtained: 

All = A22 = 0, 

„ , , , A12A21 

< A,. = Ai3 = A23 < T — — , 

M2 + A21 

A'p A^ 
Tri — - — and Tr2 — t — . 

A21 A12 



(9) 



Pietraszek [28] derived the rational region of Xr above through 
ROC curves. The error costs can be different but not for 
reject ones. Note that, however, the rejection thresholds will 
be different when A12 7^ A21. For advanced applications, Van- 
derlooy, et al |29| generalized Chow's rules by distinguishing 
error types and reject types, and derived the relations between 
two "likelihood ratio thresholds" and cost terms. Their rules 
of missing the terms An and A22 are not theoretically general, 
yet sufficient for applications. They derived formulas only 
from the inequality constraints of Risk{yi\x) > Risk{ys\x) 
and Risk{y2\x) > Risk{ys\x), respectively. Up to now, it 
seems no one has reported the general constraints (|6dt in the 
literature. Based on eq. (l6d] i. one can derive the rational dS]), 
rather than employing the intuition. 

By applying eq. ([T]) and the constraint p(ti|x)+]3(i2|x) 1, 
one can achieve the decision rules from eq. (6) with respect 
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A 






J 


V 



(a) One cross-over point without rejection. 




(b) Two cross-over points without rejection. 




Ri , R3 , R2 , R3 , Ri 



(c) One cross-over point with rejection. (d) Two cross-over points with rejection. 

Fig. 1. Rejection scenarios from the plots of p{ti\x) for univariate Gaussian distributions. 



to the posterior probabilities and thresholds in a simple and 
better form for abstaining classifiers: 



Decide yi if p{ti\%) > 1 — T^i, 
Decide 2/2 if p(i2|x) > 1-Tr2, 
Decide 2/3 for otherwise, 
Subject to < Tri + Tr2 < !■ 



(10) 



In comparison with the decision rules of eq. (6), which are 
expressed in terms of the likelihood ratio, eq. (fTOt together 
with Fig. 1 presents a better view for users to understand ab- 
staining Bayesian classifiers. A plot of posterior probabilities 
show advantages over a plot of the likelihood ratio (Figure 
2.3 in |4|) for determining rejection thresholds. Note that in 
Fig. 1 the plots are depicted on a one-dimensional variable for 
Gaussian distributions of X. The simplification supports the 
suggestions by Duda, et al, that one "should not obscure the 
central points illustrated in our simple example" [4 |. Two sets 
of geometric points are shown for the plots. One set is called 
''cross-over points", denoted by Xd, which are formed from 
two curves of p{ti\x) and p{t2\x). And the other is termed 
''boundary points", denoted by xi,j . The boundary points par- 
tition classification regions for one-dimensional problems. For 
a "no rejection" case, the boundary points are controlled by the 
ratio of (A21 ~ A22)/(Ai2 — An). In abstaining classifications, 
those points are determined from two thresholds, respectively. 
For multiple dimension problems, one can understand that 
both types of the points above become to be curves or even 
hypersurfaces. 



With the exact knowledge of p{ti), p{ii\ti), and A^, one can 
calculate Bayesian risk from the following equation: 

Risk{y*) = XiiCRi + X12E1 + Xi^Reji + \22CR2 

+ \2lE2 + A23i?ej2 

= All / p{ti)p{x\ti)dx + \i2 J p{ti)p{x\ti)dx 

Ri R2 
+A13 / p{ti)p{x\ti)dx + X21 J p{t2)p{x\t2)dx 
Ri Ri 

+ A22 / p{t2)p{Mt'2)dx + X23 J p(t2)p{x\t2)dx, 

R2 R3 

(11) 

where CRi, Ei and Reji are the probabilities of "Correct 
Recognition" , "Error", and "Rejection" for the ith class in the 
classifications, respectively; and Ri to R^ are the classification 
regions of Class 1, Class 2 and the reject class, respectively. 
The general relations among CRi, Ei and Reji for binary 
classifications are given by |[3|: 



CRi + CR2 + El 
CR 



E2 



- Reji + Rej2 
CR + E + Rej 



1, 



(12) 



and A 



CR + E' 

where CR, E, and Rej represent total correct recognition, 
total error and total reject rates, respectively; and A is the 
accuracy rate of classifications. 

B. Parameter Redundancy Analysis of Cost Terms 

Bayesian classifiers present one of the general tools for cost 
sensitive learning. From this perspective, there exists a need 
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for a systematic investigation into a parameter redundancy 
analysis of cost terms for Bayesian classifiers, which appears 
missing for a reject option. This section will attempt to develop 
a theoretical analysis of parameter redundancy for cost terms. 

For Bayesian classifiers, when all cost terms are given along 
with the other relevant knowledge about classes, a unique 
set of solutions will be obtained. However, this phenomenon 
does not indicate that all cost terms will be independent for 
determining the final results of Bayesian classifiers. In the 
foUowings, a parameter dependency analysis is conducted be- 
cause it suggests a theoretical basis for a better understanding 
of relations among the cost terms and the outputs of Bayesian 
classifiers. Based on Il35lll36l . we present the relevant defini- 
tions but derive a theorem from the functionals in eqs. (4) and 
(5) so that it holds generality for any distributions of features. 
Let a parameter vector be defined as 6* = {9i ,62, - ■ ■ ,9p}(zS, 
where p is the total number of parameters in a model /(x, 9) 
and S denotes the parameter space. 

Definition 2 (Parameter redundancy [35]): A model 
f{x,9) is considered to be parameter redundant if it can 
be expressed in terms of a smaller sized parameter vector 
/3 = /32, ■■■ ,(3q} eS, where q < p. 

Definition 3 (Independent parameters): A model f{x,f3) is 
said to be governed by independent parameters if it can be 
expressed in terms of the smallest size of parameter vector /3 = 
{/?1j/32, ■ • • 1 /^m} G S. Let Njp{f3) denote the total number 
(= m) of (3 for the model f{x,l3). 

Definition 4: (Function of parameters, parameter com- 
position, input parameters, intermediate parameters): Sup- 
pose three sets of parameter vectors are denoted by 6* = 
{9i,92,--- ,dp} e Si, 7 = {71,72, ••• ,7g} e S2, and 
77 = {r]i,r]2,--- ,r]r} S S3. If for a model there exists 
/(x, 9) — /(x, (p{ip{9))) for ip: Si S2 and S2 — > S3, we 
call If and if) to be functions of parameters, and f{%l}{9)) to be 
parameter composition, where 9i are called input parameters 
for f{x,ip{ip{9))), jj and rjk are intermediate parameters. 

Lemma 1: Suppose a model holds the relation /(x, 9) = 
/(x, (p{i/j{9))) for Definition 4. The total number of indepen- 
dent parameters of 9, denoted as Nip{f,9) for the model / 
will be no more than mm{p, q, r), or in a form of: 



9c = for their disjoint sets. Let E (or Rej) be the 
total Bayesian error (or reject) in binary classifications: 



Nip{f,9) < mm{p,q,r) 



(13) 



Proof: Suppose f{x,9 = {6*1, 6'2,--- ,9p}) without pa- 
rameter composition, one can prove that Nip{f, 9) < min(p). 
According to Definition 2, any increase of its size of 9 
over p will produce a parameter redundancy in the model. 
Definition 3 indicates that the vector size p will be an upper 
bound for Nip{f,9) in this situation. In the same principle, 
after parameter compositions are defined in Definition 4 for 
f{x,9) = f{x,(p{ip{9))), the lowest parameter size within 9, 
ip and (f, will be the upper bound of f{x,9). ■ 
For Bayesian classifiers defined by eq. (l5al l. one can rewrite 
it in a form of: 



y* = argmmRsik{y,{9x,9c}), 



(14) 



E{y*,9) = E1+E2 = 

/ p{ti)p{x\ti)dx + J p{t2)p{x\t2)dx, 

R2 Ri 
Rej{y*,9)=Reji + Rej2^ 

J p{ti)p{x\ti)dx + J p{t2)p{x\t2)dx. 

Ri Ri 



(15) 



Based on eqs. (|7]) and (fT2] ). the total error (or reject) of 
Bayesian classifiers defined by eq. ( fTSl l shows a form of 
composition of parameters: 



Eiy*, {9x, 9c))=E{v*, {xb(Tr(^?A)), ^c}), 
Rej{y\ {9x, 9c})=Rej{y* ,{^y,{T,{9x)). 9c}) 



(16) 



where Xb and Tr are two functions of the parameters. \ij (i = 
1,2, j = 1,2,3) are usually input parameters, but Trk {k = 
1, 2) can serve as either intermediate parameters or input ones. 

Theorem 2: In abstaining binary classifications, the total 
number of independent parameters within the cost terms 
for defining Bayesian classifiers, y*, should be at most two 
{Njp{y* ,9) < 2). Therefore, applications of cost terms 
of 9\ = (All, A12, Ai3, A21, A22, A23) in the traditional cost 
sensitive learning will exhibit a parameter redundancy for 
calculating Bayesian E{y*) and Rej{y*) even after assuming 
All = A22 = 0, and A12 = 1 as the conventional way in 
classifications [131|'27|. 

Proof Applying ( fT4l i and (fTsT i in Lemma 1, one can have 
Nip{y*, 9) < min(p = 6, g = 2, r = 4) = 2 for defining 
Bayesian classifiers from 9. However, when imposing three 
constraints on An = A22 = 0, and A12 — 1, 9 will provide 
three free parameters in the cost matrix in a form of: 

A21=A21 

Trl{Tr2 * A21 + Tr2 — Ml)_ 

(17) 



A13— - 



A23— 



Trl + Tr2 — 1 
Tr2{Trl * A21 + Trl — A21) 



where 9x ^ (An, A12, A13, A21, A22, A23) and 9c = 
{p{ti),p{t2),p{x\ti),p{x\t2) in binary classifications, with 



Trl + Tr2 — 1 

which implies a parameter redundancy for calculating 
Bayesian E{y*) and Rej{y*). ■ 
Remark 1: Theorem 2 describes that Bayesian classifiers 
with a reject option will suffer a difficulty of uniquely in- 
terpreting cost terms. For example, one can even enforce the 
following two settings: 

All = 0, A12 = 1, < Ai3 < 1, 
A21 = 1, A22 =0, < A23 < 1, 

All =0, 1 < A12, Ai3 = 1, 
1 < A21, A22 = 0, A23 = 1. 

for achieving the same Bayesian classifier, as well as their 
E{y*) and Rej{y*). However, the two sets of settings entail 
different meanings and do not show the equivalent relations 
except through eq. (|7]i. Hence, a confusion may be introduced 
when attempting to understand behaviors of error and reject 
rates with respects to different sets of cost terms. For this rea- 
son, cost terms may present an intrinsic problem for defining 
a generic form of settings in cost sensitive learning if a reject 
option is enforced. 
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Remark 2: While Theorem 2 only shows an estimation of 
upper-bound of Njp{y*,9) for Bayesian classifiers with a 
reject option because of missing a closed-form solution of 
E{y*,9), one can prove on Nip{y*,9) = 1 for Bayesian 
classifiers without rejection. A single independent parameter 
from the cost terms can be formed as (A12 — Aii)/(A2i — A22). 

Remark 3: We suggest to apply independent parameters for 
the design and cost analysis of Bayesian classifiers. The total 
number of independent parameters of Njp{y*,d) is change- 
able and dependent on the reject option of Bayesian classifiers. 
If rejection is not considered, we suggest 6 ~ (An = A22 = 
0, A12 =: 1, A21 > 0) for the cost or error sensitivity analysis. 
A single independent cost parameter, A21, is capable of gov- 
erning complete behaviors of error rate. For a reject option, 
we suggest 6* (0 < r^i,0 < Tr2, and Tri + Tri < 1) for 
the cost, error, or reject sensitivity analysis, which will lead 
to a unique interpretation to the analysis. 

C. Examples of Bayesian Classifiers on Univariate Gaussian 
Distributions 

This section will consider abstaining Bayesian classifiers 
on Gaussian distributions. As a preliminary study, a univariate 
feature in ID is adopted for the reason of showing theoretical 
fundamentals as well as the closed-form solutions. Therefore, 
if the relevant knowledge of p{ti) and p{x\ti) is given, one 
can depict the plots of p{ti\x) from calculation of eq. ([T]i (Fig. 
1). Moreover, when Ay is known, the classification regions 
of i?i to i?3 in terms of xi,j will be fixed for Bayesian 
classifiers. After the regions Ri to i?3, or xi,j, are determined, 
Bayesian risk will be obtained directly. One can see that these 
boundaries can be obtained from the known data of 5i when 
solving an equality equation on ( l6at or (l6b] i: 

p{x = Xc\tl)p{ti) _ 

P{x = x,\t,)p{h) ^''^ 

The data of 5i can be realized either from cost terms A^ , or 
from threshold Tri (see eq. (6)). By substituting the exact data 
of p{ti) and p{x\ti) ~ N{pLi^ai) for Gaussian distributions, 
where /Xj and ct^ represent the mean and standard deviation to 
the ith class, and the data of 5i (say, for (5i = (1 — Tri)/Tri 
from the given T^i) into ( fTSl l, one can obtain the closed-form 
solutions to the boundary points (say, for x^i and a;^): 



XblA 



-,if<yi^T2 (19a) 



Xbl 



Ml + M2 



in^— — TT-), Ci = 172 = c 



z /i2 - Ml Pih) Si 

where a is an intermediate variable defined by: 

p{tl)(J2 1 



a = (mi - M2) - (2(Tf - 2ai)ln{'- 



(19b) 



(19c) 



p{t2)(Jl 5i 

Eq. (19) is also effective for Bayesian classifiers in the case 
of "no rejection". However, only cost terms, Xij{i,j = 1,2), 
will define the data of Si. The general solution to abstain- 
ing classifiers has four boundary points by substituting two 
threshold Tri and Tr2, respectively. For the conditions shown 



in Fig. Id, Tri will lead to x^i and x^, and Tr2 to a;f,2 and 
Xf,3, respectively. Eq. (|19a| i shows a general form for achieving 
two boundary points from one data point of ^i, and eq. (|19bl l 
is specific for reaching a single boundary point only when the 
standard deviations of two classes are the same. Substituting 
the other data of S2 into eq. (19) will yield another pair of 
data Xb2 and Xbs, or a single one Xb2, in a similar form of eq. 
(19). 

Like the solution for boundary points, cross-over point(s) 
can also be obtained from solving eq. (fTSl l or (19) by sub- 
stituting Si = 1. One can prove that three specific cases will 
be met with the cross-over point(s) from the solution of eq. 
(fTsl l. namely, two, one, or zero cross-over point(s). The case 
for the two cross-over points appears only when a > in 
eq. (|19c| i, and two curves of p{ti\x) and p{t2\x) demon- 
strate the non-monotonicity (Fig. lb) through the equality 
p{ti\x) = l—p{t2\x). When the associated standard deviations 
are equal for the two classes, i.e., ai — 02, only one cross-over 
point appears, which corresponds to the monotonous curves of 
p(ti\x) and p(t2\x) (Fig. la). The case for the zero cross- 
over point occurs when a < 0, which corresponds to no 
real-value (but complex-value) solution to eq. ( |19a| l and to 
situations of non-monotonous curves of p(ti\x) and p[t2\x). 
In the followings, we will discuss several specific cases for 
rejections with respect to the cross-over points between the 
p(ti\x) andp(i2|2;) curves, as well as to the associated settings 
on Tr and A^ . A term is applied to describe every case. For 
example, "Case_fc_BU" indicates "A:" for the fcth case, "B" 
(or "M") for Bayesian (or mutual-information) classifiers, and 
"G" (or "U") for Gaussian (or uniform) distributions. 
Case_l_BG : No rejection. 

For a binary classification. Chow [31 showed that, when 
Tri ~ Tr2 > 0.5, there exists no rejection for classifiers. 
The novel constraint of Tri + Tr2 < 1 shown in eq. (l6el l 
suggests that the setting should be Tri = Tr2 = 0.5 when 
the thresholds are the input data. Users need to specify an 
option for "no rejection" or "rejection" as an input. When "no 
rejection" is selected, the conventional scheme of cost terms 
from a two-by-two matrix will be sufficient. Any usage of a 
two-by-three matrix will introduce some confusion that will 
be illustrated in the later section by Example 1. In addition, 
one cannot consider A13 ~ A23 = as the defaults for the 
cost matrix in this case. 

Case_2_BG : Rejection to all or to a complete class. 
In discussing this case, we relax the constraints in eq. (l6e] i for 
including the zero values of the thresholds. Chow Js] showed 
that, whenever = 0, a classifier will reject all patterns. 
Substituting zero values for thresholds into eq. (|2), one will 
obtain solutions for An = A22 = A13 = A23 = 0. These 
results imply that no cost is received even for a reject decision 
to a pattern. Obviously, a case like this should be avoided. 
In some situations, if one intends to reject a complete class 
(say. Class 1), its associated cost terms should be set to zero 
(say. All = -^13 — 0). We call these situations as "one-class 
and reject-class" classification, since only two categories are 
identified, that is, "Class 2" and "Reject Class", respectively. 
Case_3_BG : Rejection in two cross-over points Xd and Xc2- 
The necessary condition for realizing this case is derived from 
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TABLE I 

Rejection Settings for Bayesian Classifiers in univariate Gaussian Distributions 
{xti < Xc < xi,2 or xti < Xci < i'i,2 < Xb3 < ^c2 < I'm) 



Cross-over Point(s) 
(Reference Figure) 


Rejection 
Thresliolds 


Reject 
region(s) 


Remai'lcs 


Two 
(Fig. Id) 


Trl = 0.5, Tr2 = 0.5 





No Rejection 


Trl > 0.5, 1 - max(p(t2|a:)) < Tr2 < 0.5 


[xci,Xb2) and {xt3,Xc2] 




Trl < 0.5, Tr2 > 0.5 


[xii,Xci) and {xc2,xi,4] 




Trl < 0.5, 1 - ma,x{p{t2\x)) < Tr2 < 0.5 


[x,,l,Xb2) and (xta.XM] 


General Rejection 


Trl < 0.5, Trt < 1 - max{p(t2|x)) 


lxtl,Xt4] 


"Class-1 and Reject-class" Classification 


Trl = 0, Tr2 < 1 


{~oo,xt,2) and (a;i,3,oo) 


"Class-2 and Reject-class" Classification 


One 
(Fig. Ic) 


Trl = 0.5, Tr2 = 0.5 





No Rejection 


Trl > 0.5, Tr2 < 0.5 


[Xc,Xi,2) 




Trl < 0.5, Tr2 > 0.5 


[^bl,Xc) 


- 


Trl < 0.5, Tr2 < 0.5 


[xbl,Xb2) 


General Rejection 


Zero 
(Fig. Id) 


Trl > 1 - min{p(ti|a:)) 





"Majority-taking-air Classification 


Trl < 1 - mm(p(ti\x)) 
Tr2 < 1 - rnax(p{t2|a;)) 


[aJbi.aJw] 


"Majority-class and Reject-class" 
Classification 


Trl < 1 - min{p(ti|x)) 
Tr2 > 1 - max(p(t2|a;)) 


[^61,^62) and (xi,:j,Xb4] 


General Rejection 


Trl =0 
Tr2 > 1 - max(p(t2|x)) > 0.5 


(-oo,xi,2) and (a;i,3,oo) 


"Minority-class and Reject-class" 
Classification 


Zero, one and Two 
(Fig.l) 


Trl = Tr2 = 


(— oo, oo) 


Rejection to All 



eq. (18) for a > while assuming 5i — 1: 

/ii - M2 

^il^<4!4^e2K--i) (20) 
A21 - A22 P[ti)a2 

The general situation within this case is when T^i < 0.5 
and 1 — max{p{t2\x)) < Tr2 < 0.5, in which the reject 
region R3 is divided by two ranges. When Tri < 0.5 and 
Tr2 < l — raax{p{t2\x)) < 0.5, only one class is identified, but 
all other patterns are classified into a reject class. Therefore, 
we refer this situation as "Class 1 and Reject-class" classifi- 
cation. Table I also lists the other situations for the rejections 
from the different settings on Trj. 
Case_A_BG : Rejection in one cross-over point Xc- 
The general condition for realizing this case in the context of 
classifications is not based from setting an equality condition 
on ( I20I ) for a = 0. We neglect such setting in this case, but 
assign it into Case_5_BG. As demonstrated in eq. ( |19bt . the 
general condition of this case is a simply setting cti — 02- 
Since the monotonicity property is enabled for the curves of 
p{t\\x) and p(t2\x) in this case, a single reject region is formed 
(Fig. Ic). 

Case_5_BG : Rejection in zero cross-over point. 
The general condition for realizing this case corresponds to 
a violation of the criterion on (I19ab . or a < in ( l20b . 
In this case, one class always shows a higher value of the 
posterior probability distribution over the other one in the 
whole domain of x. From definitions in the study of class 
imbalanced dataset lfT4ll lfT6l . if p{ti) > p{t2) in binary 
classifications. Class 1 will be called a "majority' class and 
Class 2 a "minority" class. Supposing that p{ti\x) > p{t2\x), 
when Trl > 1 ~ min(p(ti all patterns will be considered 
as Class 1. We call these situations as a "Majority-taking-air 



classification. Due to the constraints like T,.i + Tr2 < 1 and 
p{ti\x) + p{t2\x) = 1, one is unable to realize a "Minority- 
taking-air classification. When Tri < 1 — min(p(ti|a;)) and 
Tr2 < 1 — niax(p(i2|a;)), all patterns will be partitioned 
into one of two classes, that is, majority and rejection. We 
call these situations "Majority-class and Reject-class" classi- 
fications. The situations of "Minority-class and Reject-class" 
classification occur if Tr2 > 1 — uiax{p{t2\x)) > 0.5 and 

Trl = 0. 

Since the study of imbalanced data learning received 
more attentions recently lfT6llfT7l lfT8l. one related theorem 
of Bayesian classifiers is derived below for elucidating thek 
important features. 

Theorem 3: Consider a binary classification with an exact 
knowledge of one-dimensional Gaussian distributions. If a 
zero-one cost function is applied, Bayesian classifiers without 
rejection will satisfy the following rule: 

if Pmin = min(p(ii),p(i2)) 0, and 

Aii=A22 = 0,Ai2 = A2i = l (21) 

then E V Eraax — Pmini 

which indicates that the classifiers have a tendency of reaching 
the maximum Bayesian error, E„iax, by misclassifying all 
rare-class patterns in imbalanced data learning. 

Proof: We will prove the misclassification of all rare-class 
patterns first. Suppose p{t2) represents the prior probability 
of the "minority" or "rare" class in imbalanced data learning 
and consider the special case firstly on the equal variances for 
two classes (Fig. la). When p{t2) approaches to zero, Xc will 
approach infinity from using eq. (I19bb with Si = 1. This result 
indicates that Bayesian classifiers will assign all patterns into 
the "majority" class in classifications. When the variances are 



g 



not equal, eqs. ( |19a| i and ( |19c| i with Si = 1 will be applicable 
(Fig. lb). One can obtain the relation a < for the case that no 
cross-over point occurs on p{ti\x) plots when ^(^2) approaches 
to zero. Only the "majority" class is identified from using 
Bayesian classifiers in this case. The equality of Emax — Pmin 
suggests an upper bound of Bayesian error (See Appendix 
B). If violating this bound, Bayesian classifiers will adjust 
themselves for achieving the smallest error rate. ■ 

D. Examples of Bayesian Classifiers on Univariate Uniform 
Distributions 

Chow f3l presented a study on rejection from Bayesian 
classifiers along uniform distributions for one-dimensional 
problems. This section will extend Chow's results by providing 
general formulas of parameterized distributions. A binary 
classification is considered. The two uniform distributions on 
two classes are given: 



p{x\ti) 



1 

when xi < X < X2 ^ 

X2-X1 (22a) 

otherwise 



- X3 





when X3 < X < Xi 
otherwise 



(22b) 



Three specific cases, shown in Fig. 2, will appear, namely, 
"Partially overlapping", "Fully overlapping by one class", and 
"Separating" between two distributions for eq. (22). We will 
discuss each case with respect to their rejection settings. 
Case_l_BU : Partially overlapping between two distribu- 
tions. 

Suppose that the constraints for this case are: 



Xi < X < X4, and xi < X3 < X2 < x^. 



(23) 



When the relevant knowledge of p{ti) and p{x\ti) is given, 
one is able to gain the posterior probabilities from eqs. ([T] 
and (21) by a closed form: 

p{ti\x) = 



1 

p(ti){x4, - X3) 



when xi < X < X3 
when X3 < X < X2 



p{ti){xi - X3) + p{t2){x2 - X\) 

otherwise 



(24a) 



Pihlx) 



p{t2){x2 - Xi 



when X2 < X < X4 
when X3 < X < X2 



p{ti){x4, - X3) + p{t2){x2 - Xi) 

otherwise 

(24b) 

Based on the Bayesian rules of eq. (fTOl l and eq. (24), one 
can immediately determine Ri = [xi,X3) and R2 ~ [2:2, 2:4] 
directly for Class 1 and Class 2, respectively, as shown in 
Fig. 2. The remaining range is denoted as Ri = [2:3, 2:2 
since it needs to be identified further depending on the 
thresholds defined in Q. Due to the simplicity of the uniform 
distributions, one is able to realize analytical solutions directly 



for Bayesian classifiers. The probabilities of errors and rejects 
are calculated from : 

p{t2){x2 - X3) 

— ^ — , if f[x e Rt) = yi 

(Xi - X3) 



{X2 - Xi) 

0, 



if f{x e Ri) = y3 



and 



Rej = <^ 



0, 
0, 

{X2 



if fix £ R 
if fix e R 



yi 

pih) 



(26) 



{X2 - Xi) (X4 - X3) 

if fix e Ri) = j/3 



We use fix G Ri) = yj to describe a decision that Ri 
is a range of Class j. Eq. (25) demonstrates that Bayesian 
classifiers with uniform distributions of classes will receive 
error either from Class 1 or from Class 2, but not both. When 
setting cost terms properly, zero error can be achieved with 
conditions of rejection on both classes as shown in eq. (26). 
It is interesting to observe that cost terms can only control 
the error types or give the appearance of rejection, but not the 
degree of them. This is significantly different from Bayesian 
classifiers with Gaussian distributions of classes. 
Case_2_BU : Fully overlapping by one class. 
The constraints for this case are: 

Xi < X < X2, and xi < X3 < x^ < X2- (27) 
and the posterior probabilities are: 

Piti\x) = 



piti)ix4 - X3) 



when xi < X < X3 
or Xi < X < X2 

when X3 < X < X4 



p(tl)(.T4 - 2:3) +pit2)ix2 - Xl) 

otherwise 



(28a) 



Pihlx) 



Pit2)ix2 - Xl) 



when X3 < X < Xi 



piti)ixi - 2:3) +pit2)ix2 - Xl) 

otherwise 

(28b) 

Following the similar way in the previous case, one can obtain 
the analytical results: 

Pih), if fix e Ri) = yi 

pitl)iXi -X3) -r ft ^ .^Q. 

— ^ — , if fix e R,) = 2/2 , (29) 

(2:2 - Xl) 

0, if fix e i?,) = 2/3 



E = 



and 



Rej = 



if fix e Ri) = yi 
if fix e Ri) = 2/2 



iffixeR.)^y3 



X2 - Xl 



(30) 

Specific solutions will be received in this case on Class 2, 
which is full overlapped within Class 1. All patterns within 
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1^ 





p(x \l ) 







X 


1 








Ri , R,- , R2 



Ri 



R, 



^4 -^2 



(a) Partially overlapping. (b) Fully overlapping by one class. 

Fig. 2. Classification scenarios for univariate uniform distributions. 



p(x I? ) 



p(x |( ) 



(c) Separating. 



X 

-► 



Class 2 may be misclassified or rejected fully depending on 
the settings of cost terms. 

Case_3_BU : Separation between two distributions. 

One is able to obtain the exact solutions without any error and 

reject. Cost terms are useless in this case. 

III. Mutual-information based Classifiers with A 
Reject Option 

A. Mutual-information based Classifiers 

Definition 5 (Mutual-information classifier): A mutual- 
information classifier is the classifier which is obtained from 
the maximization of mutual information over all patterns: 



j/^ = arg max NI{T = t.Y ^ y), 
y 



(31) 



where T and Y are the target variable and decision output 
variable, t and y are their values, respectively. For simplicity, 
we denote NI{T ^ t,Y = y) = NI{T, Y) as the normaUzed 
mutual information in a form of ifTTl : 

I{T,Y) 



NI{T, Y) 



H{T) 



(32a) 



where H{T) is the entropy based on the Shannon definition 
ll37l to the target variable. 



(32b) 



and I{T, Y) is mutual information between two variables of 
T and Y 



I{T,Y) 



rn m+1 



(32c) 



where m is a total number of classes in T. For binary 
classifications, we set m = 2. In (32), p{t, y) is the joint 
distribution between the two variables, and p{t) and p{y) are 
the marginal distributions which can be derived from [38|: 



Pit) 



y 



p(t,y), and p{y) 



E 



p{t,y). (33) 



Mathematically, eq. ( |3T1 ) expresses that is an optimal 
classifier in terms of the maximal mutual information, or 
relative entropy, between the target variable T and decision 
output variable Y . The physical interpretation of relative 
entropy is a measurer of probability similarity between the two 
variables. Note that the present definition of NI is asymmetry 
to the variables T and Y for the normalization term of H{T) 
(=constant, for given p{t)), but will not make a difference for 



arriving at the optimal y+ defined by (l3Tl i. We adopt Shannon's 
definition of entropy for the reason that no free parameter 
is introduced. A normalization scheme is applied so that a 
relative comparison can be made easily among classifiers. 

Definition 6 (Augmented Confusion Matrix flf]): An aug- 
mented confusion matrix will include one column for a 
rejected class, which is added on a conventional confusion 
matrix: 



Cll Ci2 
C21 C22 



Clm Cl(m+1) 
C2m C2(m+i) 



(34) 



where Cij represents the number of the ith class that is clas- 
sified as the jth class. The row data corresponds to the exact 
classes, and the column data corresponds to the prediction 
classes. The last column represents a reject class. The relations 
and constraints of an augmented confusion matrix are: 



Ci — ^ ^ Cij, Ci > 0, Cij ^0, i — 1,2,' 



(35) 



where C, is the total number for the ith class. The data for 
Ci is known in classification problems. 

In this work, supposing that the input data for classifications 
are exactly known about the prior probability p{ti) and the 
conditional probability density function p{ii\ti), one is able 
to derive the joint distribution matrix in association with the 
confusion matrix: 

Pij ^P{U,V]) = J p{U)p{x\U)dx « ^pe{U,yj), 

Ri 



i = 1,2, • • • ,m, j = 1,2, ■ 



1 



(36) 

where Rj is denoted as the region in which every pattern x 
is identified as the jth class, and pe{ti,yj) is the empirical 
probability density for applications where only a confusion 
matrix is given. In those applications, the total number of 
patterns n is generally known. 

Eq. ( |36] | describes the approximation relations between 
the joint distribution and confusion matrix. If the knowledge 
about p{ti) and p{x\ti) are exactly known, one can design a 
mutual information classifier directly. If no initial information 
is known about p{ti) and p{x\ti), the empirical probability 
density of joint distribution, pe{ti,yj), can be estimated from 
the confusion matrix [11|. This treatment, based on the fre- 
quency principle of a confusion matrix, is not mathematically 
rigorous, but will offer a simple approach for classifiers to 
apply the entropy principle for wider applications. 
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pih) 



pCh) 



[1 - erfiXn)] 
[1 - erfiX2i)] 



pit 



2 



erf{X,2)] ^[erfiXn) + erfiX,^)] 
erf {X22)] ^[erf (X21) + erf {X22)] 



(42a) 



Considering binary classifications, one will have the follow 
ing formula for the joint distribution pit, y): 

/ p{ti)p{x\ti)dx J p{ti)p{x\ti)dx J p{ti)p{x\ti)dx 

-Ri R.2 R3 

J p{t2)p{x\t2)dx J pit2)pix\t2)dx J p{t2)pix\t2)dx 
Ri R2 R3 

(37) 

The marginal distribution for is in fact the given infor- 
mation of prior knowledge about the classes: 



p{t) = {p{ti),p{t2))'' 



(38) 



where the superscript "T" represents a transpose, and the 
marginal distribution for p{y) is: 

piy) (p(yi),p(2/2),p(y3)) = (/ Qdx, J Qdx, J Qdx) 

Ri R2 Ra 

Q=p{ti)p{x\ti)+p{t2)p{x\t2). 

(39) 

Substituting ( [37] i - ( [38] l into (32), one can obtain the formula 
of NI in terms of p{ti) and p{x\ti). When the prior knowledge 
of p{ti) is given, the conditional entropy H{T) in eq. (I32bb 
will be unchanged during classifier learnings. This is why we 
use this term to normalize the mutual information in (I32at. 



B. Examples of Mutual-information Classifiers on Univariate 
Gaussian Distributions 

Mutual-information classifiers, like Bayesian classifiers, 
also provide a general formulation to classifications. They are 
able to process classifications with or without rejection. This 
section will aim at deriving novel formulas necessary for the 
design and analysis of mutual-information classifiers under 
assumptions of Gaussian distributions. The assumptions, or 
given input data, for the derivations are kept the same as those 
for Bayesian classifiers shown in Section II, except that cost 
terms of Ay are not given as the input, but will be displayed as 
the output of the classifiers. In other words, mutual information 
classifiers will automatically calculate the two thresholds that 
can lead to the cost terms through eq. (|7]i. However, due to a 
redundancy among six cost terms, one will fail to obtain the 
unique solution of the cost terms, which is demonstrated in 
Example 1 of Section IV. 

Generally, one is unable to derive a closed-form solution 
to mutual-information classifiers. One of the obstacles is the 
nonlinear complexity of solving error functions. Therefore, 
this work only provides semi-analytical solutions for mutual 
information classifiers. When substituting p{ti) and p{x\ti) 
into eqs. (l3Tl i and (32), one will encounter the process of 
solving an inverse problem on the following function: 

maxiV/(T,y) = max/(x,0 = {p{t^) , p{x\t,) , Xf,,)) , (40) 

for searching the boundary points Xbj from error functions. 
Only numerical solutions can be obtained for x^j, except 



for a special case. Whenever a reject option is set, mutual- 
information classifiers will generate classification regions, 
Ri [i = 1,2,3), automatically according to the given data 
of p{ti) and p{x\ti), as shown in Table II. In the followings, 
some specific cases of mutual-information classifiers will be 
discussed in related to a reject option. 

Case_\_MG : No rejection in one cross-over point x^ when 
p{ti) = p{t2) and a\ = 02. 

This is a very special case where one is able to obtain a 
closed-form solution to mutual-information classifiers. Under 
the conditions of p{ti) = p{t2), (Ji = 0-2, and two by two 
joint distribution matrix for no rejection, one can get a single 
boundary point Xb, coincident to the cross-over point Xc, for 
partitioning the classification regions: 



Xb — Xc 



A*l + 



if fii < fi2 then Ri = (-00, Xfc), i?2 = [xb, 00), R3 = 0. 

(41) 

This result exhibits similar results for Bayesian classifiers, 
which leads to the same error values between the two types 
of classifiers. Therefore, eq. (l4ll indicates that ?/+ = y* to 
be fully equivalent between mutual-information classifiers and 
Bayesian classifiers under the conditions of p{ti) = p{t2) and 
(Ti — (72 when no reject option is selected. 
Case_2_MG : Rejection in one cross-over point Xc and 

(Jl = fT2- 

When we relax the condition in the case above on p{ti) 7^ [12) 
and with a reject option, the solutions to mutual-information 
classifiers become not fully analytical. The key step for miss- 
ing such an analytical solution comes from a determination of 
Xbj- In this case, due to the condition that ai = (T2, one will 
have a single cross-over point Xc as the general case in binary 
classifications for Gaussian distributions. If a reject option is 
selected, one will generally have two boundary points Xbx and 
Xb2- Suppose /ii < pL2 and Xbi < Xb2, one can partition 
classification regions as: Ri = {—oo,Xbi),R2 = [xb2,oo), 
and i?3 = [xbi,Xb2)- Supposing the two boundary points are 
given, one can have a closed-form formula on eq. (l37T i: 

(Please see the equation on the top of this page) 
where er/(-) is an error function, and 



Xi 



fJ-l - Xbj 



1,2, j = l,2. 



(42b) 



In this work, we adopt a numerical approach to search the 
results on Xbi and Xb2- Whenever these values are known, 
one can get the error rate and reject rate from: 



E = Ei+E2= p{t, = l,yj = 2)+p{U = 2,y, = 1) 
.^[l-e./(X,2)] + ^[l-e./(X20] 



(43a) 
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TABLE II 

Classification Regions FOR Mutual-information Classifiers IN Univariate Gaussian Distributions OF Fig. 1 {xbi < X52 < ^'m < x^i) 



Reject Option 


Cross-over Point(s) 


Boundary Point(s) 


Class of Ri 


Class of R2 


Class of R:j 


No Rejection 


Xq 




(-CO,Xb) 


[xb,co) 









{—oo,Xbi) and {xb2,oD) 







Rejection 


Xc 


Xbl,Xb2 


{-O0,Xbl) 


[a:62, 00) 


[2^61,3:62) 




^bl,Xb2,Xb3,^b4 


{—oo,Xbi) and {xb2,oo) 


[a;62,a;63] 


[xbi,Xb2) and (x^.x^] 



Rej = Reji + Rej2 

= PiU = 1, Vj = 3) + p{U 2, y-i = 3) 

+ PM[erf{X2i)+erf(X22)] 



(43b) 



Case_3_Af G : Rejection in two cross-over points. 
This is a general case for mutual-information classifiers in 
which four boundary points, x\jj, are formed. When the 
four points obtained numerically during solving eq. (l3ll . the 
classification regions i?i to R^ will be set as shown in Table II. 
With the condition of Xbi < Xb2 < Xbs < xm, the closed-form 
solution of p{t, y) can be given in a similar way of eq. (42). 
Additionally, both error and reject rates can be evaluated from 
p{t, y). For comparing with Bayesian classifiers, the equivalent 
rejection thresholds are derived from the given data of Xbj'- 



Trl = 1 -p{ti\x = Xbl) 



-(Xbl - Hlf 



1 - 



{xbl~^llY_ -(x6i-^2)^ 

+ p(t2)crie 



p{ti)a2e 2(7i 



2ai 



(44a) 



Tr2 = 1 -Pihlx = Xb2) 



p{t2)crie 



-{xb2 - 
2al 



p{ti)a2e 



-{Xb2 - /^i) 
2^2 



-{Xb2 - M2) 



p{t2)(Tie 



2aj 



(44b) 

With the condition of Xbi < Xb2 < Xb3 < XbA shown in 
Fig. Id, substituting either Xbi or XbA into (44) will give the 
same value on T^i, and a similar one for Xb2 or XbTi on TV2. 
The results of T^i and Tr2 indicate that mutual-information 
classifiers will automatically search the rejection thresholds 
for balancing the error rate and reject rate for the given data 
of classes. This specific feature will be discussed in Section 
IV. 



C. Examples of Mutual-information Classifiers on Univariate 
Uniform Distributions 

When comparing with Bayesian classifiers, we examine 
mutual-information classifiers on uniform distributions in this 
section. The two classes and their conditional probability 
density functions are given in (22). Three cases will be 



discussed below. 

Case_y_MU : Partially overlapping between two distribu- 
tions. 

In this case (Fig. 2a), one needs to construct joint distribution 
p{t, y) first. For binary classifiers, p{t, y) is given in the 
following forms: 



p{ti) 

p{t2){x2 - X3) pit2){x4: - X2) 



{Xi - 2:3) (X4 - Xz) 

if f(x e Ri) = yi 

P{tl){x3 - Xi) p{ti){x2 - X3) 
{X2 - Xi) ix2~'Xi) 
P{t2) 

if f{x e Ri) = 2/2 







(45a) 



(45b) 



pit,y) 



P{tl){x3 - Xi) 







p{ti){x2 ~ X3) 



{X2-X1) ix2-Xi) 

p{t2){xA~ X2) P{t2){x2-X3) 



( , 

[X4 - X3) 

if f{x e = ys 



{xi - X3) 
(45c) 

Eq. (45) demonstrates three sets of p{t, y) due to diffident 
decisions may be involved with Ri in Fig. 2a. Substituting 
(45) into (32), one will obtain three sets of NI's. The closed- 
form solutions about the decision can be given, but this work 
adopts a numerical approach for omitting tedious descriptions 
of the formulas. 

Case_2_MU : Fully overlapping by one class. 
The formula for p{t, y) in this case (Fig. 2b) is: 



pit,y) 



Pih) 
p{t2) 



, if fix e R,) = yi (46a) 



p{t,y) = 

P{h){x2 -Xi-Xi+ X3) p{ti){Xi - X3) 
[X2~xi) {X2-Xi) 

P{t2) 

if f{x e Ri) = y2 



p{t,y) = 

Pitl)ix2 - Xi- X4+ X3) 



(46b) 







p(ti){x4 - X3) 



{X2-Xl) ix2-Xi) 
pit2) 

if f{x e R,) ^ ys 

(46c) 

One can get the following results through substituting (46) 
into (32): 



NI{t,y)=0, if f{xeRi) = yi 



(47a) 
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< NI{t,y) < 1, if fix e R,) = 2/2(or y^). 



(47b) 



Eq. (47a) suggests that the decision for f{x E Ri) = yi 
will produce zero information. Therefore, mutual information 
classifiers will never make this kind of decisions (but Bayesian 
classifiers may do so). 

Case_3_MU : Separation between two distributions. 
Mutual-information classifiers will show the perfect solutions 
as those for Bayesian classifiers. 

IV. Comparisons between Bayesian Classifiers and 
Mutual-information Classifiers 

A. General Comparisons 

Mutual-information classifiers provide users a wider per- 
spective in processing classification problems, hence a larger 
toolbox in their applications. For discovering new features in 
this approach, the present section will discuss general aspects 
of mutual-information and Bayesian classifiers at the same 
time for a systematic comparison. The main objective of the 
comparative study is to reveal their corresponding advantages 
and disadvantages. Meanwhile, their associated issues, or new 
challenges, are also presented from the personal viewpoint of 
the author 

First, both types of classifiers share the same assumptions 
of requiring the exact knowledge about class distributions and 
specifying the status of the reject option (Table III). The "exact 
knowledge" feature imposes the most weakness on the two 
approaches in applications. In other words, the approaches are 
more theoretically meaningful, rather than directly useful for 
solving real-world problems. When the exact knowledge is 
not available, the existing estimation approaches to class dis- 
tributions |4||33||40| for Bayesian classifiers will be feasible 
for implementing mutual-information classifiers. The learning 
targets of Bayesian classifiers involve evaluations of risks or 
errors, which is mostly compatible with classification goals in 
real-life applications. However, the concept of mutual infor- 
mation, or entropy-based criteria, is not a common concern or 
requirement from most classifier designers and users [IT. 

Second, Bayesian classifiers will ask (or implicitly apply) 
cost terms for their designs. This requirement provides both 
advantages and disadvantages depending on applications. The 
main advantage is its flexibility in offering objective or sub- 
jective designs of classifiers. When the exact knowledge is 
available and reliable, inputing such data will be very simple 
and meaningful for realizing objective designs. At the same 
time, subjective designs will always be possible. The main 
disadvantage may occur for objective designs if one has 
incomplete information about cost terms. Generally, cost terms 
are more liable to subjectivity than prior probabilities. In this 
case, avoiding the introduction of subjectivity is not an easy 
task for Bayesian classifiers. Mutual-information classifiers, 
without requiring cost terms, will fall into an objective ap- 
proach. They carry an intrinsic feature of "letting the data 
speak for itself \ which exhibits a significant difference from 
a subjective version of Bayesian classifiers. However, the 
current definition of mutual-information classifiers needs to 
be extended for carrying the flexibility of subjective designs. 



which is technically feasible by introducing free parameters, 
such as fuzzy entropy |42|. 

Third, one of the problems for the current learning targets 
of Bayesian classifiers is their failure to obtain the optimal 
rejection threshold in classifications. Although Chow |l3] and 
Ha 1221 suggested formulas respectively in forms of: 



or 



minRisk{Tr) = E[Tr) + TrRej{Tr), 



E{Tr 



Rej{Tr)' 



(48) 



(49) 



respectively, a minimization from both formulas will lead 
to a solution of = for Risk = 0, which implies a 
rejection of all patterns. Therefore, we can expect to establish 
a meaningful learning target which is applicable to Bayesian 
classifiers for determining optimal rejection thresholds. On the 
contrary, mutual-information classifiers are able to achieve the 
optimal rejection thresholds as the classifiers' outcomes. The 
remaining issue is to study the optimal cases in a systematic 
way. 

Fourth, Bayesian classifiers generally fail to handle class 
imbalanced data properly if no cost terms are specified in 
classifications, as described in Theorem 3. When one class ap- 
proximates a smaller (or zero) population and no distinction is 
made among error types, Bayesian classifiers have a tendency 
to put all patterns of the smaller class into error, and its NI will 
be approximately zero, which represents that no information 
is obtained from classifiers |9|. Mutual-information classifiers 
display particular advantages in these situations, including 
cases for abstaining classifications. They provide a solution of 
balancing error types and reject types without using cost terms. 
The challenge lies in their theoretical derivation of response 
behaviors, such as, upper bound and lower bound of Ei/p{ti) 
for mutual-information classifiers. 

Fifth, mutual-information classifiers will add extra com- 
putational complexities and costs over Bayesian classifiers. 
Both types of classifiers require computations of posterior 
probability. When these data are obtained, Bayesian classi- 
fiers will produce decision results directly. However, mutual- 
information classifiers will need further procedures, such as, 
to form a confusion matrix (or a joint distribution matrix), 
to evaluate NI in ( |3TI ). and to search boundary points from 
a non-convex space NI in (|40| ). These procedures will in- 
troduce significantly analytical and computational difficulties 
to mutual-information classifiers, particularly in multiple-class 
problems with high dimensions. 

Note that the discussions above provide a preliminary 
answer to the question posed in the title of this paper In 
another connection. Appendix B presents the tighter bounds 
between conditional entropy and Bayesian error in binary clas- 
sifications. Further investigations are expected to search other 
differences under various assumptions or backgrounds, such as 
distributions of mixture models, multiple-class classifications 
in high dimension variables, rejection to a subset of classes 
1221 . and experimental studies from real-world datasets. 
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TABLE III 

Data Information for Bayesian and Mutual-information Classifiers in Binary Classifications 



Classifier 


Required Input 


Learning 


Output 


Type 


On Data 


On Rejection 


Target 


Data 




P{tl), P(t2) 






El, E2, Reji, Rej2, 




p(x|ti), p{x\t2) 


No 


min Risk{y) 


Risk, 


Bayesian 


All, '^12, Ai3 


or 


or 


Ri, R2, R3, 


A21, A22, A23 
(or Tri, and Tr2) 


Yes 


min E(y) 


Trl, Tr2, 

({A21/A12}, or 
{A21, A13, A23}) 










El, E2, Reji, Rej2, 






No 




NI, 


Mutual- 


p{tl\p{t2) 


or 


ma.xNI{T,Y) 


Ri, R2, R3, 


Information 


p(x|ti), p(x\t2) 


Yes 




Trl, Tr2, 

({A21/A12}, or 
{A21, A13, A23}) 



B. Comparisons on Univariate Gaussian Distributions 

Gaussian distributions are important not only in theoretical 
sense. To a large extent, this assumption is also appropriate 
for providing critical guidelines in real applications. For clas- 
sification problems, many important findings can be revealed 
from a study on Gaussian distributions. 

The following numerical examples are specifically designed 
for demonstrating the intrinsic differences between Bayesian 
and mutual-information classifiers on Gaussian distributions. 
For calculations of NTs values on the following example, 
an open-source toolkit (39] is adopted for computations of 
mutual-information classifiers. 

Example 1: Two cross-over points. The data for no rejection 
are given below: 

No rejection : 

^1 = -1, CTi = 2, p{ti) = 0.5, All = 0, A12 = 1, 
[12 = 1, 02 = 1, p(t2) = 0.5, A21 = 1, A22 

The cost terms are used for Bayesian classifiers, but not 
for mutual-information classifiers. Table IV lists the results 
for both classifiers. One can obtain the same results when 
inputing A13 = 1 — A23 for Bayesian classifiers. This is 
why a two-by-two matrix has to be used in the case of no 
rejection. Two cross-over points are formed in this examples 
(Fig. lb). If no rejection is selected, both classifiers will have 
two boundary points. Bayesian classifiers will partition the 
classification regions by having Xbi = a^ci = —0.238 and 
2^62 — Xc2 = 3.57. Mutual-information classifiers widen the 
region R2 by a;bi = —0.674 and Xi,2 ~ 4.007 so that the error 
for Class 2 is much reduced. If considering zero costs for 
correct classifications and using eq. (fTSl l with 5i = A21/A12, 
one can calculate a cost ratio below for an independent 
parameter to Bayesian classifiers in the case of no rejection: 

. A2I P{X = Xb\ti)p{ti) 

A12 PKX ^ Xiy\t2)pit2) 

which is used to establish an equivalence between mutual- 
information classifiers and Bayesian classifiers. Substituting 
the boundary points of mutual-information classifier at a;f,i = 
—0.674 and Xb2 — 4.007 into p{x\ti) and (50), respectively. 



one receives a unique cost ratio value, A21 = 2.002. Hence, 
this mutual-information classifier has its unique equivalence to 
a specific Bayesian classifier which is exerted by the following 
conditions to the cost terms: 

All = 0, A12 = 1.0, A21 = 2.002, A22 = 0. 

Following the similar analysis above, one can reach a 
consistent observation for conducting a parametric study on 
(7i/(T2 in binary classifications. When two classes are well 
balanced, that is, p{ti) = p{t2), both types of classifiers 
will produce larger errors in association with the larger- 
variance class. However, mutual-information classifiers always 
add more cost weight on the misclassification from a smaller- 
variance class. In other words, mutual-information classifiers 
prefer to generate a smaller error on a smaller-variance class 
in comparison with Bayesian classifiers when using zero- 
one cost functions (Table IV). This performance behavior 
seems closer to our intuitions in binary classifications under 
the condition of a balanced class dataset. When two classes 
are significantly different from their associated variances, a 
smaller-variance class generally represents an interested signal 
embedded within noise which often has a larger variance. The 
common practices in such classification scenarios require a 
larger cost weight on the misclassification from a smaller- 
variance class, and vice verse from a larger- variance class. 

If a reject option is enforced for the following data: 

Rejection : 

m ^ -l,ai = 2,p{ti) = 0.5, All = 0, A12 - 1.2, 
Ai3 = 0.2, 

fl2 = 1, (72 = 1,^(^2) = 0.5, A21 = 1, A22 = 0, 

A23 - 0.6 

four boundary points are required to determine classification 
regions as shown in Fig Id. For the given cost terms, a 
Bayesian classifier shows a lower error rate and a lower reject 
rate. While the rejects are almost equal between two classes, 
the errors are significantly different. One is able to adjust 
the errors and rejects by changing cost terms. For mutual- 
information classifiers, however, a balance is automatically 
made among error types and reject types. The results, shown in 
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TABLE IV 

Results of Example 1 on Univariate Gaussian Distributions 



Reject 


Classifier 


El 




Reji 




Trl 


Xbl, Xb2 




Option 


Type 


E2 


E 


Rej2 


Rej 




Xb3< ^64 


NI 






0.170 









- 


-0.238, 3.571 




No 


Bayesian 


0.057 


0.227 












0.245 


Rejection 


Mutual- 


0.215 











-0.674, 4.007 






Information 


0.024 


0.239 












0.260 






0.131 




0.083 




0.333 


-0.673, 0.162 






Bayesian 


0.024 


0.155 


0.084 


0.167 


0.375 


3.171, 4.006 


0.285 


Rejection 


Mutual- 


0.154 




0.118 




0.141 


-1.24, -0.0762 






Information 


0.006 


0.160 


0.068 


0.186 


0.445 


3.409, 4.571 


0.297 



Table IV, are considered for carrying the feature of objectivity 
in evaluations since no cost terms are specified subjectively. 
Note that a reject option enables both classifiers to reach 
higher values on their NTs than those in the case of without 
rejection. Because no "one-to-one" relations exist among the 
thresholds and the cost terms in a rejection case, one will fail 
to acquire a unique set of the equivalent cost terms between 
the Bayesian classifier and the mutual information classifier. 
For example, two sets of cost terms below will produce the 
same Bayesian classifiers based on the given solutions of the 
mutual information classifier: 

All = 0, A12 = 1, Ai3 = 0.0376, 
A21 = 1, A22 = 0, A23 = 0.772 

or 

All - 0, A12 = 2.247, Ai3 = 1, 
A21 = 7.069, A22 -0, A23 = 1. 

The meanings for two sets of cost terms are different. The 
first set indicates the same costs for errors, but the second one 
suggests the same costs for rejects. The results above imply 
an intrinsic problem of ''non-consistency" for interpreting cost 
terms. One needs to be cautious about this problem when 
setting cost terms to Bayesian classifiers. This phenomenon 
occurs only in the case that a reject option is considered, but 
does not in the case without rejection. If the knowledge about 
thresholds exists, abstaining classifiers are better to apply Trk 
directly for the input data (Table III), instead of employing 
cost terms. If no information is given about the thresholds or 
cost terms, mutual-information classifiers are able to provide 
an objective, or initial, reference of T^k for Bayesian classifiers 
in cost sensitive learning. 

Example 2: One cross-over point. The given inputs in this 
example are: 

No rejection : 

p.1 = -1, CTi = 1, All = 0, A12 = 1, 
p.2 = 1, 0-2 = 1, A21 = 1, A22 = 0, 
p{ti) = 0.5, 2/3, 0.8, 0.9, 0.99, 0.999, 0.9999 
p{t2) = 0.5, 1/3, 0.2, 0.1, 0.01, 0.001, 0.0001 

Specific attention is paid to the class imbalanced data. When 
Class 2 alters from "balanced", "minority" to "rare" status 
in the whole data, we need to find out what behaviors both 
types of classifiers will display. For this purpose, a natural 
scheme with zero-one cost terms is set for Bayesian classifiers. 



Numerical investigations are conducted in this example. Table 
V lists the results of classifiers on the given data. If following 
the conventional term FNR for "false negative rate" in binary 
classifications, which is defined as: 



FNR : 



E2 



(51) 



one can examine behaviors of FNR with respect to the ratio 
p{ti)/p{t2)- Sometimes, FNR is also called a "miss rate" 
[4|. Two types of classifiers show the same results when two 
classes are exactly balanced, that is, p{ti) / p{t2) = 1. A single 
boundary point (Fig. la) separates two classes at the exact 
cross-over point (xi, = Xc — 0). When one class, say ^(^2) 
for Class 2, becomes smaller, the boundary point of Bayesian 
classifier moves toward to the mean point {p2 = 1) of Class 

2 (as pointed out in [|4|, page 39]), and passes it finally. For 
keeping the smallest error, a Bayesian classifier will sacrifice 
the minority class. The results in Table V confirm Theorem 

3 numerically on the Bayesian classifiers. Fig. 3 shows such 
behavior from the plot of "^^2/^(^2) vs. p{ti) / p{t2)" ■ Note 
that the plots for the range from lO^"' to 10" on the p{ti ) /p{t2) 
axis are also depicted based on the data in Table V. For 
example, at the data point of p{ti)/p{t2) = 1/2, one can get 
E2/p{t2) = 0.0594/(2/3), where 0.0594 is taken from Ei 
for the data at p{ti)/p{t2) = 2. The response of £^2/^(^2), 
representing the false negative rate, shows a distinguished 
property of Bayesian classifiers. One can observe that the 
complete set of Class 2 could be misclassified when it becomes 
extremely rare. This finding explains another reason for the 
question: "Why do classifiers perform worse on the minority 
class?" in ifBll . 

Mutual-information classifiers exhibit different behavior in 
the given dataset. The first important feature is that the 
boundary point will shift toward the mean point {p2 — 1) 
of Class 2 but will never go over it. The second feature 
informs that the response of £^2/^(^2) approaches asymptot- 
ically to a stable value, about 0.345 in this example, for a 
large ratio of p{ti)/p{t2)- This feature indicates that mutual- 
information classifiers will never sacrifice a minority class 
completely in this specific example. A significant fraction of 
the rare class is identified correctly. Moreover, the curve of 
£^2/^(^2) also demonstrates a lower, yet non-zero, bound on 
error rate (about 0.054) when p{ti) / p{t2) approaches to zero. 
This phenomenon implies that, for Gaussian distributions of 
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TABLE V 

Results of Example 2 on Univariate Gaussian Distributions 



Classifier 




1 


2 


4 


9 


99 


999 


9999 


Type 


[P(tl),p{t2)] 


[0.5,0.5] 


[2/3, 1/3] 


[0.8,0.2] 


[0.9,0.1] 


[0.99,0.01] 


[0.999,0.001] 


[0.9999,0.0001] 




El 


0.0793 


0.0594 


0.0362 


0.0161 


0.483e-3 


0.422e-5 


0.000 




E2 


0.0793 


0.0856 


0.0759 


0.0539 


0.903e-2 


0.993e-3 


0.1 e-3 


Bayesian 


E2/p{t2) 


0.159 


0.257 


0.379 


0.539 


0.903 


0.993 


1.000 




Xb{= Xc) 


0.0 


0.347 


0.693 


1.10 


2.30 


3.45 


4.61 




H{T\Y) 


0.631 


0.591 


0.491 


0.349 


0.0756 


0.0113 


0.00147 




NI 


0.369 


0.356 


0.320 


0.256 


0.0644 


0.00524 


0.124e-3 




El 


0.0793 


0.0867 


0.0852 


0.0772 


0.0585 


0.0551 


0.0547 




E2 


0.0793 


0.0637 


0.0451 


0.0264 


0.331e-2 


0.343e-3 


0.345e-4 


Mutual- 


E2/p{t2) 


0.159 


0.191 


0.225 


0.264 


0.331 


0.343 


0.345 


Information 




0.0 


0.126 


0.246 


0.367 


0.562 


0.597 


0.601 




H{T\Y) 


0.631 


0.586 


0.472 


0.320 


0.0629 


0.00957 


0.00129 




NI 


0.369 


0.362 


0.346 


0.317 


0.222 


0.161 


0.125 




Fig. 3. Curves of "i?2/p(i2) vs. p{ti)/p(t2)". Solid curve: Bayesian 
classifier Dashed curve: Mutual-information classifier. 



classes, mutual-information classifiers generally do not hold 
a tendency of sacrificing a complete class in classifications. 
However, from a theoretical viewpoint, we still need to es- 
tablish an analytical derivation of lower and upper bounds of 
Ei/p{ti) for mutual-information classifiers. 

Example 3: Zero cross-over points. The given data for two 
classes are: 

Ml =0, (71 = 2, p{h) = 0.8, 

M2 =0, (72 = 1, P{t2) = 0.2. 

Although no data are specified to the cost terms, it generally 
implies a zero-one lost function for them [4|. From eq. (fTFt , 
one can see a case of zero cross-over point occurs in this 
example (Fig. 4c). For the zero-one setting to cost terms, the 
Bayesian classifier will produce a specific classification result 
of "Majority-taking-air, that is, for all patterns identified as 
Class 1. The error gives to Class 2 only, and it holds the 
relation of NI = 0, which indicates that no information is 
obtained from the classifier |9 1. One can imagine that the given 
example may describe a classification problem where a target 
class, with Gaussian distribution, is also corrupted with wider- 
band Gaussian noise in a frequency domain (Fig. 4a). The plots 
of p{ti)p{x\ti) shows the overwhelming distribution of Class 
1 over that of Class 2 (Fig. 4b). The plots on the posterior 



probability p{ti\x) indicate that Class 2 has no chance to be 
considered in the complete domain of x (Fig. 4c). 

Table VI lists the results for both types of classifiers. The 
Bayesian approach fails to achieve the meaningful results on 
the given data. When missing input data of A13 and A23, 
one cannot carry out the Bayesian approach for abstaining 
classifications. On the contrary, without specifying any cost 
term, mutual-information classifiers are able to detect the 
target class with a reasonable degree of accuracy. When no 
rejection is selected, less than two percentage error {E2 — 
1.53%) happens to the target class. Although the total error 
{E = 51.4%) is much higher than its Bayesian counterpart 
(E = 20%, FNR = 0%), the result of about eight percentage 
point {FNR — 7.65%) of the miss rate to the target is 
really meaningful in applications. If a reject option is engaged, 
the miss rate is further reduced to FNR = 4.10%, but 
includes adding a reject rate of 29.1% over total possible 
patterns. This example confirms again the unique feature of 
mutual-information classifiers. The results of T^fc from mutual- 
information classifiers can also serve a useful reference for the 
design of Chow's abstaining classifiers, either with or without 
knowledge about cost terms. 

C. Comparisons on Univariate Uniform Distributions 

Uniform distributions are very rare in classification prob- 
lems. This section shows one example given from [3|. A 
specific effort is made on numerical comparisons between the 
two types of classifiers. 

Example 4: Partially overlapping between two distribu- 
tions. The task for this example is to set the cost terms for 
controlling the decision results on the overlapping region for 
the given data from (|3]: 



p{x\ti) = 



1 when < X < 1 
otherwise 



p[x\t2) 



1/2 when 0.5 < a; < 2.5 
otherwise 



p(ii) =p(i2) = 0.5. 
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(a) Plots of p(x\tj). 
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(b) Plots of pit i)p(^tj). 
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(c) Plots of/>(?Jx). 



Fig. 4. Plots for Example 3 where (b)-(c) describe a signal (blue curve) embedded by wider-band noise (black curve). 

TABLE VI 

Results of Example 3 on Univariate Gaussian Distributions 



Reject 


Classifier 


El 




Reji 










Option 


Type 


E2 


E 


Rej2 


Rej 


Tr2 




NI 






0.0 















No 


Bayesian 


0.2 


0.2 












0.0 


Rejection 


Mutual 


0.499 











-1.77, 1.77 






Information 


0.0153 


0.514 












0.0803 




Mutual 


0.316 




0.239 




0.0945 


-2.04, -1.03 




Rejection 


Infonnation 


0.00819 


0.324 


0.0520 


0.291 


0.749 


1.03, 2.04 


0.0926 



In uniform distributions, a single independent parameter will 
be sufficient for classifications. Table VII lists the different 
results with respect to Tr- Note that the present results have 
extended Chow's abstaining classifiers by adding one more 
decision case of f{x e Ri) — y2 than those in ||3|. The 
extension is attributed to the three rules used in eq. (ITOt . rather 
than two in Chow's classifiers, which demonstrates a more 
general solution for classifications. One can see that mutual- 
information classifiers will decide f{x^Ri)—y^ from the 
given data of class distributions sine they receive the maximum 
value of NI. If no rejection is enforced, mutual-information 
classifiers will choose f {x £ Ri) — yi for their solution. 

V. Conclusions 

This work explored differences between Bayesian classifiers 
and mutual-information classifiers. Based on Chow's pio- 
neering work JlllISl, the author revisited Bayesian classifiers 
on two general scenarios for the reason of their increasing 
popularity in classifications. The first was on the zero-one cost 
functions for classifications without rejection. The second was 
on the cost distinctions among error types and reject types 
for abstaining classifications. In addition, the paper focused 
on the analytical study of mutual-information classifiers in 
comparison with Bayesian classifiers, which showed a basis 
for novel design or analysis of classifiers based on the en- 
tropy principle. The general decision rules were derived for 
both Bayesian and mutual-information classifiers based on 
the given assumptions. Two specific theorems were derived 
for revealing the intrinsic problems of Bayesian classifiers in 
applications under the two scenarios. One theorem described 
that Bayesian classifiers have a tendency of overlooking the 
misclassification error which is associated with a minority 
class. This tendency will degenerate a binary classification 



into a single class problem for the meaningless solutions. 
The other theorem discovered the parameter redundancy of 
cost terms in abstaining classifications. This weakness is not 
only on reaching an inconsistent interpretation to cost terms. 
The pivotal difficulty will be on holding the objectivity of 
cost terms. In real applications, information about cost terms 
is rarely available. This is particularly true for reject types. 
While Berger explained the demands for "objective Bayesian 
analysis" f43l, we need to recognize that this goal may fail 
from applying cost terms in classifications. In comparison, 
mutual-information classifiers do not suffer such difficulties. 
Their advantages without requiring cost terms will enable 
the current classifiers to process abstaining classifications, 
like a new folder of "Suspected Mail" in Spam filtering 
f44]. Several numerical examples in this work supported 
the unique benefits of using mutual-information classifiers in 
special cases. The comparative study in this work was not 
meant to replace Bayesian classifiers by mutual-information 
classifiers. Bayesian and mutual-information classifiers can 
form "complementary rather than competitive (words from 
Zadeh ['451)" solutions to classification problems. However, 
this work was intended to highlight their differences from 
theoretical studies. More detailed discussions to the differences 
between the two types of classifiers were given in Section IV. 
As a final conclusion, a simple answer to the question title is 
summarized below: 

"Bayesian and mutual-information classifiers are different 
essentially from their applied learning targets. From appli- 
cation viewpoints, Bayesian classifiers are more suitable to 
the cases when cost terms are exactly known for trade-off of 
error types and reject types. Mutual-information classifiers are 
capable of objectively balancing error types and reject types 
automatically without employing cost terms, even in the cases 
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TABLE VII 

Results of Example 4 on Univariate Uniform Distributions 



Tr 


Decision on Ri 


-El, E2 


E 


Reji, Rej2 


Rej 


NI 


1/3 < Tr < 2/3 


f(x G -Ri) = yi 


0.0, 0.125 


0.125 


0, 





0.549 


2/3 <Tr <l 


f(x G Ri) = y2 


0.250, 


0.250 


0, 





0.311 


0<Tr < 1/3 


f(x G Ri) = ?/3 


0, 





0.250, 0.125 


0.375 


0.656 



of extremely class-imbalanced datasets, which may describe 
a theoretical interpretation why humans are more concerned 
about the accuracy of rare classes in classifications" . 

Appendix A 
Proof of Theorem 1 

Proof: The decision rule of Bayesian classifiers for the 
"no rejection" case is well known in [4]. Then, only the 
rule for the "rejection" case is studied in the present proof. 
Considering eq. (l6al i first from (l5at . a pattern x is decided by 
a Bayesian classifier to be yi if risk{iji\s) < risk{y2\x) and 
risk{yi\x) < risk{ys\x). Substituting eqs. ([T]l and (|2]i into 
these inequality equations will result to: 

•^21 



Decide t/i if 



and 



p{x 


tl)p{tl) 


p(x 


t2)p(t2) 


p{x 




p{x 


t2)p{t2) 



> 



> 



^22 
Al2 — All 
•^21 — A23 



(Al) 



A 



13 



Similarly, one can obtain 

Decide y2 if 



and 



p{x 


ti)pih) 


pix 


t2)p{t2) 


p{x 


tl)p{tl) 


p{x 


t2)p{t2) 



< 



A. 



Al 



A. 



< 



A12 — All 
A23 ~ A22 



(A2) 



A 



12 



A 



13 



and eq. ( l6cT l respectively. Eq. (lAU describes that a single 
upper bound within two boundaries will control a pattern x 
to be j/i. Similarly, eq. (IA2l i describes a lower bound for 
a pattern x to be 2/2- From the constraints (|3]l, one cannot 
determine which boundaries will be upper bound or lower 
bound. However, one can determine them from the following 
two hints in classifications: 

A. Eq. ( l6cb describes a single lower boundary and a single 
upper boundary for a pattern x to be j/3. 

B. The upper bound in ( lAlb and the lower bound in (IA2) 
should be coincident with one of the boundaries in ( |6c] i 
respectively so that classification regions from i?i to 
will cover a complete domain of the pattern x (see Fig. 
Ic-d). 

The hints above suggest the novel constraints for A^ as 
shown in eq. ( l6db . Any violation of the constraints will 
introduce a new classification region i?4, which is not correct 
for the present classification background. The constraints of 
thresholds for rejection (|6ei can be derived directly from (l6c] i 
and ( l6db . ■ 

Appendix B 

Tighter Bounds between Conditional Entropy and 
Bayesian Error in Binary Classifications 

In the study of relations between mutual information (/) and 
Bayesian error {E), two important studies are reported on the 



lower bound {LB) by Fano ||48l and the upper bound iJJB) 
by Kovalevskij ||49l in the forms of 



LB : E> 



H{T)- I{TX) - H{E) H{T\Y)-H{E) 



UB : E < 



log2{m - 1) 
H{T)-I{T,Y) H{T\Y) 



log2{m-l) 

(Bl) 
(B2) 



2 2 
where m is the total number of classes in T, H{E) is the 
binary Shannon entropy, and H{T\Y) is called conditional 
entropy which can be derived from a general relation [0]: 

/(T, Y) ^ I{Y, T) ^ H{T) - H{T\Y) = H{Y) - H{Y\T). 

(B3) 

For binary classifications (m = 2), a tighter Fano's bound 
in II50I llsn is adopted. Based on the rationals of Bayesian 
error, we suggest the tighter upper and lower bounds in the 
forms of: 

Modified LB : H{E) > H{T\Y), and < E, (B4) 

Modified UB: E < min{p{ti),p(t2), ^^^^^^^ ). (B5) 

Fig. 5 shows the bounds in binary classifications, which is 
different from "/(T, F) vs. E" plots in |51|. Because of the 
equivalent relations ifTTl : 



max I{T,Y) — min H(T\Y), 



(B6) 



the plots for H{T\Y) is preferable, which does not require 
the information of H{T). One is able to draw the lower- 
bound curve from (B4), but unable to show its explicit form 
for E. The areal feature of the enclosed bounds suggests two 
important properties about the relations. The first is due to 
the approximations in the derivations of the bounds P8l ||49l . 
The second represents an intrinsic property of no "one-to- 
one" relations between mutual information and accuracy in 
classifications |10|. 

Triangles and circles shown in Fig. 5 represent the paired 
data in Table V from Bayesian classifiers and mutual in- 
formation classifiers, respectively. They clearly demonstrate 
the specific forms in their positions within the same pairs. 
The circle position is either coincident or "up and/or left" 
to its counterpart. These forms are attributed to the different 
directions of driving force for two types of classifiers. One is 
for "min E" and the other for "min H{T\Y)". 

Important findings are observed in related to the bounds. 
First, the triangles demonstrate Fano's bound in eq. (B4) to 
be a very tight lower bound. Second, an upper bound of 
Emax exists according to Theorem 3, which is tighter than a 
constant one (= 0.5) in |50|. When Pmin decreases as shown 
in Table V, the upper bound from the maximum Bayesian error 
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Driving force direction of 
mutual information classifiers 
4 E — 




0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 

Fig. 5. The bounds between conditional entropy H(T\Y) and Bayesian 
error in binary classifications. Triangles and circles are the data in Table V 
from Bayesian classifiers and mutual information classifiers, respectively. An 
upper bound from the maximum Bayesian error exists, say, Emax = 0.2 for 
the filled triangle. 

will become closer to its associated data. Third, the Fano's 
lower bound is effective for all classifiers, including mutual 
information classifiers. However, the upper bounds, even the 
constant one {— 0.5) becomes invalid for mutual information 
classifiers (see the data E = 0.514 in Table VI). 

The observations above indicate the necessity of further 
investigation into the upper bounds for better descriptions of 
the relations. If much tighter upper bounds are possible, they 
are desirable to disclose their theoretical insights between the 
two types of classifiers. 
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